Predicting Player Performance in Baseball using Hitting Data¶

James Rouse

Introduction:

This project revolves around predicting player performance for the 2023 baseball season. Focusing on individual player hitting data, the goal is to build predictive models that predict a player's potential contribution to scoring runs. The made up scenario is that a team is in need of someone to bat in runs.

Target Variable: Runs Batted In (RBIs) serves as the key performance metric. It signifies a player's impact on their team's runs.

Approach: The project spans data collection, preprocessing, exploratory data analysis (EDA), and model building. By parsing hitting data from the previous 5 seasons, key attributes were extracted, such as batting average, slugging percentage, home runs, and walks.

Building Predictive Models: Employing various machine learning algorithms as well as linear regression, models will be constructed to predict RBI's. Training, tuning, and evaluation will lead to models capable of forecasting player performance.

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
from pybaseball import batting_stats_range, team_batting
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, cross_val_predict
from sklearn.tree import plot_tree
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import statsmodels.api as sm
from sklearn.model_selection import KFold
pd.options.mode.chained_assignment = None

Data Collection: To start, data was fetched from pybaseball. pybaseball is a Python package that provides a convenient interface for accessing various baseball-related data, statistics, and information from online sources. The code below uses the function batting_stats_ragne from pybaseball, which creates a data frame called batting_data_df. This contains batting data for all MLB players this year.

In [2]:
# Get the current year
current_year = datetime.now().year
today_date = datetime.now().date()
today_str = today_date.strftime('%Y-%m-%d')

# Calculate yesterday's date

# Fetch batting statistics for the current season up until yesterday
raw_data = batting_stats_range(start_dt=f'{current_year}-03-30', end_dt=today_str)
In [3]:
# Print the fetched data
batting_data_df = pd.DataFrame(raw_data)

batting_data_df.head()
Out[3]:
Name Age #days Lev Tm G PA AB R H ... SH SF GDP SB CS BA OBP SLG OPS mlbID
1 CJ Abrams 23 1 Maj-NL Washington 100 446 398 65 103 ... 1 2 2 20 10 0.259 0.333 0.460 0.792 682928
2 Jos\xc3\xa9 Abreu 37 50 Maj-AL Houston 33 112 106 9 14 ... 0 1 4 0 0 0.132 0.170 0.208 0.377 547989
3 Wilyer Abreu 25 1 Maj-AL Boston 82 281 251 42 67 ... 0 3 4 7 2 0.267 0.335 0.486 0.821 677800
4 Ronald Acu\xc3\xb1a Jr. 26 67 Maj-NL Atlanta 48 217 188 37 46 ... 0 0 4 15 3 0.245 0.346 0.362 0.707 660670
5 Willy Adames 28 1 Maj-NL Milwaukee 107 464 408 57 102 ... 0 3 11 12 3 0.250 0.334 0.436 0.770 642715

5 rows × 28 columns

Data Cleaning: Ensuring Accurate Analysis

Baseball data often contains errors, missing values, and inconsistencies due to various data collection methods and sources. Cleaning the data removes inaccuracies, fixes missing values, and standardizes variables

In [4]:
# Getting info about the columns in the dataframe
batting_data_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 593 entries, 1 to 616
Data columns (total 28 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    593 non-null    object 
 1   Age     593 non-null    int64  
 2   #days   593 non-null    int64  
 3   Lev     593 non-null    object 
 4   Tm      593 non-null    object 
 5   G       593 non-null    int64  
 6   PA      593 non-null    int64  
 7   AB      593 non-null    int64  
 8   R       593 non-null    int64  
 9   H       593 non-null    int64  
 10  2B      593 non-null    int64  
 11  3B      593 non-null    int64  
 12  HR      593 non-null    int64  
 13  RBI     593 non-null    int64  
 14  BB      593 non-null    int64  
 15  IBB     593 non-null    int64  
 16  SO      593 non-null    int64  
 17  HBP     593 non-null    int64  
 18  SH      593 non-null    int64  
 19  SF      593 non-null    int64  
 20  GDP     593 non-null    int64  
 21  SB      593 non-null    int64  
 22  CS      593 non-null    int64  
 23  BA      589 non-null    float64
 24  OBP     589 non-null    float64
 25  SLG     589 non-null    float64
 26  OPS     589 non-null    float64
 27  mlbID   593 non-null    int64  
dtypes: float64(4), int64(21), object(3)
memory usage: 134.4+ KB

It looks like there are 6 rows with missing values. Since only 6 out of the 609 rows are missing, these rows can just be dropped.

In [5]:
# Dropping NA values
batting_data_df = batting_data_df.dropna()
In [6]:
batting_data_df.duplicated().sum()
Out[6]:
0

Because I am predicting RBI's, I can remove variables that won't be needed for the prediction. ID, days, name, as well as base running stats won't be necessary for calculating runs batted in. The ID and name are just labels and base running occurs after a batter is on base, where they can't obtain any RBI's.

In [7]:
# Dropping unecessary columns
columns_to_drop = ['mlbID', 'CS', 'SB', '#days', 'Name']
batting_data_df = batting_data_df.drop(columns=columns_to_drop)

In order for consistency, ease of access, and clarity, the variable names will be changed to snake case format

In [8]:
# Creating a dictionary to map the data.
column_name_mapping = {
    'Age': 'age',
    'Lev': 'league',
    'Tm': 'team',
    'G': 'games',
    'PA': 'plate_appearances',
    'AB': 'at_bats',
    'R': 'runs',
    'H': 'hits',
    '2B': 'double',
    '3B': 'triple',
    'HR': 'home_run',
    'RBI': 'runs_batted_in',
    'BB': 'walk',
    'IBB': 'intentional_walk',
    'SO': 'strikeouts',
    'HBP': 'hit_by_pitch',
    'SH': 'sacrifice_hit',
    'SF': 'sacrifice_fly',
    'GDP': 'double_play',
    'BA': 'batting_average',
    'OBP': 'on_base_percentage',
    'SLG': 'slugging_percentage',
    'OPS': 'on_base_plus_slugging'
}

batting_data_df.rename(columns=column_name_mapping, inplace=True)

# Print the modified DataFrame
batting_data_df.columns
Out[8]:
Index(['age', 'league', 'team', 'games', 'plate_appearances', 'at_bats',
       'runs', 'hits', 'double', 'triple', 'home_run', 'runs_batted_in',
       'walk', 'intentional_walk', 'strikeouts', 'hit_by_pitch',
       'sacrifice_hit', 'sacrifice_fly', 'double_play', 'batting_average',
       'on_base_percentage', 'slugging_percentage', 'on_base_plus_slugging'],
      dtype='object')

Investigating the teams column, as it is of type object in order to see what it contains. While investigating, it looked like some players had multiple teams separated by a comma. It looks like only 6% of players have played for multiple teams and later on in the project, a table will be joined by team name, so these players will just be dropped.

In [9]:
# Inspecting the unique values of the 'team' column
batting_data_df['team'].unique()
Out[9]:
array(['Washington', 'Houston', 'Boston', 'Atlanta', 'Milwaukee',
       'Los Angeles', 'Los Angeles,San Francisco', 'Arizona',
       'Kansas City', 'Oakland', 'New York', 'Colorado', 'Chicago',
       'Miami', 'Tampa Bay', 'St. Louis', 'Cleveland',
       'Seattle,Tampa Bay', 'Miami,San Diego', 'San Diego', 'Detroit',
       'Pittsburgh', 'San Francisco', 'Baltimore', 'Toronto',
       'Cincinnati', 'Chicago,Miami', 'Los Angeles,Toronto', 'Seattle',
       'Philadelphia', 'Minnesota', 'St. Louis,Tampa Bay', 'Texas',
       'Miami,New York', 'Boston,Chicago', 'New York,Oakland',
       'Miami,Pittsburgh', 'Chicago,Kansas City', 'Houston,Oakland',
       'Cincinnati,Seattle', 'Los Angeles,Tampa Bay', 'Chicago,Texas',
       'Atlanta,Los Angeles', 'Baltimore,Philadelphia',
       'San Francisco,Texas', 'Boston,Toronto', 'Detroit,Texas',
       'Atlanta,Cleveland', 'Houston,Toronto', 'Baltimore,San Francisco',
       'Atlanta,Philadelphia', 'Chicago,Tampa Bay', 'Chicago,New York',
       'Chicago,St. Louis', 'Chicago,Los Angeles', 'Tampa Bay,Washington',
       'Seattle,Washington', 'Atlanta,Washington', 'Chicago,Washington',
       'Atlanta,Boston,New York', 'Cincinnati,San Francisco',
       'Atlanta,San Francisco', 'Baltimore,Miami', 'Cleveland,Washington',
       'Los Angeles,New York', 'Seattle,Toronto', 'Atlanta,Miami',
       'New York,Washington'], dtype=object)
In [10]:
# Determining how many values in the 'team' column have multiple teams
count_with_comma = batting_data_df[batting_data_df['team'].str.contains(',')].shape[0]

# Count the total number of players
total_players = batting_data_df.shape[0]

# Calculate the proportion
proportion_with_comma = (count_with_comma / total_players) * 100
print(f"Proportion of players with comma in team name (traded players): {proportion_with_comma:.2f}%")
Proportion of players with comma in team name (traded players): 7.47%
In [11]:
# Get value counts of all teams
batting_data_df['team'].value_counts()
Out[11]:
Chicago                39
Los Angeles            34
New York               33
Cincinnati             25
Seattle                21
                       ..
Chicago,Texas           1
Atlanta,Los Angeles     1
San Francisco,Texas     1
Boston,Toronto          1
New York,Washington     1
Name: team, Length: 68, dtype: int64

As stated earlier, it looks like the data contains players who belong to multiple teams. These are likely players who have likely been traded. Because there is a small proportion of player who have been traded, these players will just be ommitted from the data.

In [12]:
# Removing data where the team has a comma.
batting_data_df = batting_data_df[~batting_data_df['team'].str.contains('[,]')]
batting_data_df['team'].unique()
Out[12]:
array(['Washington', 'Houston', 'Boston', 'Atlanta', 'Milwaukee',
       'Los Angeles', 'Arizona', 'Kansas City', 'Oakland', 'New York',
       'Colorado', 'Chicago', 'Miami', 'Tampa Bay', 'St. Louis',
       'Cleveland', 'San Diego', 'Detroit', 'Pittsburgh', 'San Francisco',
       'Baltimore', 'Toronto', 'Cincinnati', 'Seattle', 'Philadelphia',
       'Minnesota', 'Texas'], dtype=object)

It is also clear that there is no differentiation between teams in the same city. To fix this, we will need to check the league column to see which team the player is actually on to fix this error. For example, if a team is in New York, but in the American League, we will set the team name to New York Yankees insstead of just New York

In [13]:
# Correcting team names for cities with multiple teams
conditions = [
    (batting_data_df['team'] == 'New York') & (batting_data_df['league'] == 'Maj-AL'),
    (batting_data_df['team'] == 'New York') & (batting_data_df['league'] == 'Maj-NL'),
    (batting_data_df['team'] == 'Los Angeles') & (batting_data_df['league'] == 'Maj-NL'),
    (batting_data_df['team'] == 'Los Angeles') & (batting_data_df['league'] == 'Maj-AL'),
    (batting_data_df['team'] == 'Chicago') & (batting_data_df['league'] == 'Maj-NL'),
    (batting_data_df['team'] == 'Chicago') & (batting_data_df['league'] == 'Maj-AL')
]

choices = ['New York Yankees', 'New York Mets', 'Los Angeles Dodgers', 'Los Angeles Angels', 'Chicago Cubs', 'Chicago White Sox']

batting_data_df['team'] = np.select(conditions, choices, default=batting_data_df['team'])
batting_data_df.head()
Out[13]:
age league team games plate_appearances at_bats runs hits double triple ... intentional_walk strikeouts hit_by_pitch sacrifice_hit sacrifice_fly double_play batting_average on_base_percentage slugging_percentage on_base_plus_slugging
1 23 Maj-NL Washington 100 446 398 65 103 23 6 ... 3 88 12 1 2 2 0.259 0.333 0.460 0.792
2 37 Maj-AL Houston 33 112 106 9 14 2 0 ... 0 26 2 0 1 4 0.132 0.170 0.208 0.377
3 25 Maj-AL Boston 82 281 251 42 67 24 2 ... 0 77 1 0 3 4 0.267 0.335 0.486 0.821
4 26 Maj-NL Atlanta 48 217 188 37 46 8 1 ... 0 52 3 0 0 4 0.245 0.346 0.362 0.707
5 28 Maj-NL Milwaukee 107 464 408 57 102 25 0 ... 2 110 1 0 3 11 0.250 0.334 0.436 0.770

5 rows × 23 columns

Implementing a threshold for minimum plate appearances to filter out players with limited or insufficient data.

In [14]:
# Setting a threshold for minimum plate appearance
batting_data_df = batting_data_df[batting_data_df['plate_appearances'] >= 150]

Exploratory Data Analysis

Graphs depicting the correlations of runs batted in will be generated, aiming to explore their associations with various factors such as age, team affiliation, and total hits are constructed

In [15]:
# Scatter plot: RBI vs Hits
plt.figure(figsize=(10, 6))
sns.regplot(data=batting_data_df, x='hits', y='runs_batted_in')
plt.xlabel('Hits')
plt.ylabel('RBIs')
plt.title('Scatter Plot: RBIs vs Hits')
plt.show()

# Box plot: RBIs by Age
plt.figure(figsize=(10, 6))
sns.boxplot(data=batting_data_df, x='age', y='runs_batted_in')
plt.xlabel('Age')
plt.ylabel('RBIs')
plt.title('Box Plot: RBIs by Age')
plt.show()

# Correlation heatmap
correlation_matrix = batting_data_df.corr(numeric_only = True)
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# Histogram: Distribution of RBIs
plt.figure(figsize=(10, 6))
sns.histplot(data=batting_data_df, x='runs_batted_in', bins=20)
plt.xlabel('RBIs')
plt.ylabel('Frequency')
plt.title('Distribution of RBIs')
plt.show()

# Bar plot: RBIs by Team
plt.figure(figsize=(12, 6))
sns.barplot(data=batting_data_df, x='team', y='runs_batted_in')
plt.xlabel('Team')
plt.ylabel('Average RBIs')
plt.title('Bar Plot: Average RBIs by Team')
plt.xticks(rotation=90)
plt.show()

The exploration of the relationship between Runs Batted In (RBIs) and Hits reveals a notable trend, suggesting a positive correlation between these two variables. As the number of hits increases, players tend to achieve higher RBIs, indicating a connection between their ability to connect with the ball and their effectiveness in driving runs. However, while this positive correlation is apparent, it is essential to consider the distribution of RBIs. A slight right skewness is observed in the distribution, implying that there are more players with fewer RBIs and relatively fewer players with higher RBIs. This skewness may potentially be mitigated through the removal of outliers, which could influence the distribution's shape.

Further analysis of the relationship between age and RBIs indicates a lack of a clear trend between the two variables. Age does not seem to be a significant predictor of RBIs, suggesting that a player's effectiveness in driving runs is not solely influenced by their age. This observation underscores the complex interplay of various factors contributing to a player's performance in this aspect of the game.

Being on different teams appears to impact a player's total RBI stats. To account for this, the team variable will either be dummy-encoded, or another dataframe with team stats will be merged for feature engineering.

When examining the heatmap, one notable finding is the dependence of a player's Runs Batted In (RBIs) on the number of games played and their plate appearances. Additionally, there is a strong link between RBIs and strikeouts. This may seem counterintuitive, but it is not as complex as it appears. When a player has more plate appearances, they tend to accumulate more RBIs and more strikeouts, making these two numbers appear connected.

However, considering whether a high number of strikeouts indicates productivity is crucial. Normalization can address this by dividing each statistic by plate appearances, creating a more balanced perspective. This method allows teams to assess a player's expected performance per at-bat, providing a clearer understanding of their effectiveness.

Correlation Insights: Cumulative Stats and RBIs

Upon inspecting the heatmap of correlations, a notable pattern emerges between RBIs and cumulative statistics, including strikeouts. This relationship stems from the inherent connection between cumulative stats and the number of opportunities for RBIs. Players with more at-bats tend to accumulate more RBIs, a logical outcome considering the increased chances to drive in runs.

However, the presence of high RBIs correlated with higher strikeouts also underscores the nuanced nature of this relationship. While accumulating RBIs is a positive indicator of offensive contributions, an increased frequency of strikeouts may not be as desirable. Striking out is often associated with missed scoring opportunities, highlighting the balance between power hitting and plate discipline.

To ensure a fair assessment of player performance, it becomes imperative to normalize these statistics per at-bat. This approach levels the playing field, allowing us to evaluate how efficiently a player generates RBIs relative to their plate appearances. By normalizing per at-bat, we mitigate the influence of sheer volume and gain a more accurate perspective on a player's RBI efficiency, unearthing the true impact of their offensive prowess.

In [16]:
# Inspecting the columns of the data frame
batting_data_df.columns
Out[16]:
Index(['age', 'league', 'team', 'games', 'plate_appearances', 'at_bats',
       'runs', 'hits', 'double', 'triple', 'home_run', 'runs_batted_in',
       'walk', 'intentional_walk', 'strikeouts', 'hit_by_pitch',
       'sacrifice_hit', 'sacrifice_fly', 'double_play', 'batting_average',
       'on_base_percentage', 'slugging_percentage', 'on_base_plus_slugging'],
      dtype='object')
In [17]:
# Normalizing variables
variables_to_normalize = ['hits', 'double', 'triple', 'home_run', 'runs_batted_in', 
                          'walk', 'intentional_walk', 'strikeouts', 'hit_by_pitch',
                          'sacrifice_hit', 'sacrifice_fly', 'double_play']

# Iterate through the variables and add normalized columns, then drop the original columns
for var in variables_to_normalize:
    normalized_col_name = f'{var}_per_pa'
    batting_data_df[normalized_col_name] = batting_data_df[var] / batting_data_df['plate_appearances']
    batting_data_df.drop(columns=[var], inplace=True)
In [18]:
# Scatter plot: RBI vs Hits
plt.figure(figsize=(10, 6))
sns.regplot(data=batting_data_df, x='hits_per_pa', y='runs_batted_in_per_pa')
plt.xlabel('Hits per Plate Appearance')
plt.ylabel('RBIs per Plate Appearance')
plt.title('Scatter Plot: RBIs vs Hits (Normalized)')
plt.show()

# Box plot: RBIs by Age
plt.figure(figsize=(10, 6))
sns.boxplot(data=batting_data_df, x='age', y='runs_batted_in_per_pa')
plt.xlabel('Age')
plt.ylabel('RBIs per Plate Appearance')
plt.title('Box Plot: RBIs by Age (Normalized)')
plt.ylim(-.001, 0.3)  # Set y-axis limit to 0.3
plt.show()

# Correlation heatmap
correlation_matrix = batting_data_df.corr(numeric_only=True)
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# Histogram: Distribution of RBIs
plt.figure(figsize=(10, 6))
sns.histplot(data=batting_data_df, x='runs_batted_in_per_pa', bins=20)
plt.xlabel('RBIs per Plate Appearance')
plt.ylabel('Frequency')
plt.title('Distribution of RBIs (Normalized)')
plt.show()

# Bar plot: RBIs by Team
plt.figure(figsize=(12, 6))
sns.barplot(data=batting_data_df, x='team', y='runs_batted_in_per_pa')
plt.xlabel('Team')
plt.ylabel('Average RBIs per Plate Appearance')
plt.title('Bar Plot: Average RBIs by Team (Normalized)')
plt.xticks(rotation=90)
plt.show()

A closer examination of the data, specifically after normalizing the variables by plate appearances, revealed some interesting insights. The connections between most variables and Runs Batted In (RBIs) became less pronounced. This initially surprising observation has a straightforward explanation.

Normalizing the variables by plate appearances means evaluating performance "per turn at bat." By doing so, the analysis considers a player's performance relative to their opportunities to hit. Before normalization, players with more plate appearances naturally accumulated more RBIs and other stats such as hits, walks, and strikeouts. This made these numbers appear more interconnected than they truly were.

With normalization, the analysis levels the playing field, allowing for a fair comparison of players based on their performance in each turn at bat. Players with fewer plate appearances but strong performance in each opportunity stand out more clearly. This method removes the skew caused by varying numbers of plate appearances and highlights each player's efficiency.

Interestingly, normalizing the data also led to a more normal distribution of RBIs per plate appearance. The distribution began to resemble a more balanced and symmetric shape, indicating that when plate appearances are accounted for, RBIs per plate appearance align more closely with a typical distribution.

To prepare for building models, it's important to select the right features to include. A useful step in this process is to visualize how each feature relates to the target variable using regression plots. This approach helps to identify features with strong relationships to the target, which may be valuable for model development.

In [19]:
# Get the list of numeric column names
numeric_column_names = batting_data_df.select_dtypes(include='number').columns.tolist()

# Create subplots for each numeric column
num_rows = (len(numeric_column_names) + 2) // 3
fig, axes = plt.subplots(num_rows, 3, figsize=(15, 5 * num_rows))

# Flatten the axes if there's only one row
if num_rows == 1:
    axes = [axes]

# Create scatter plots with regression lines for numeric columns
for i, var in enumerate(numeric_column_names):
    row = i // 3
    col = i % 3
    ax = axes[row][col]
    
    sns.regplot(data=batting_data_df, x=var, y='runs_batted_in_per_pa', ax=ax, color='blue', scatter_kws={'alpha': 0.5})
    ax.set_xlabel(var)
    ax.set_ylabel('RBIs per Plate Appearance')
    ax.set_title(f'{var} vs. RBIs')

# Adjust layout
plt.tight_layout()
plt.show()

The variables 'runs,' 'batting average,' 'slugging percentage,' 'on-base percentage,' 'doubles per plate appearance,' 'triples per plate appearance,' 'hits per plate appearance,' 'home runs per plate appearance,' and 'walks per plate appearance' have shown the strongest linear associations with 'RBIs per plate appearance' in the analysis. Given their significant relationships with the target variable, these variables have been selected for the regression model. Including these key variables will help capture the main factors that influence 'RBIs per plate appearance.' This careful selection ensures that the model will provide meaningful insights and accurate predictions about the relationship between these variables and the target variable.

Outliers for the target variable are removed prior to regression analysis. Outliers are extreme data points that can distort results and decrease model accuracy. Removing outliers helps the model to perform better by focusing on typical data points, leading to a more reliable analysis that accurately reflects the relationships between variables. The criteria used for identifying outliers is being 3 standard deviations away from the mean.

Preparing the Linear Regression Model

One of the assumptions of Because of this, we may want to add team as a categorical variable rather than the merged team stats, as these don't appear to be linear. Variables that do appear to be linear are runs, batting average, slugging percentage, on base percentage, doubles, home runs, hits, and walks. Unfortunately, it doesn't look like the variables that we merged are linear with RBIs per at bat, so they won't be included in the regression

Linear regression is conducted below and the first step is to normalize the data using the min max scaler. After scaling, the coefficients of the predictor variables in the regression model represent the change in the target variable (in this case, RBIs per plate appearance) for a one-unit change in the predictor while keeping other predictors constant. With all variables on the same scale, the coefficients can be directly compared to understand the relative influence of each predictor on the target. Next, the data is cross-validated using k-fold cross-validation for a more accurate and averaged out r-squared and MSE to evaluate the model.

Linear Regression Model

In [20]:
# Constructing first linear regression model
predictor_vars = ['runs', 'batting_average', 'slugging_percentage', 'on_base_percentage', 'double_per_pa', 'triple_per_pa', 'hits_per_pa', 'home_run_per_pa', 'walk_per_pa']
X = batting_data_df[predictor_vars]

scaler = MinMaxScaler()
X[predictor_vars] = scaler.fit_transform(X[predictor_vars])

y = batting_data_df[['runs_batted_in_per_pa']]

# Initialize k-fold cross-validation
num_folds = 5  
kf = KFold(n_splits=num_folds, shuffle=True, random_state=25)

mse_scores = []
r2_scores = []


residuals_list = []
y_pred_list = []

for train_index, val_index in kf.split(X):
    X_train, X_val = X.iloc[train_index], X.iloc[val_index]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]

    model = LinearRegression()
    model.fit(X_train, y_train)

    y_pred_val = model.predict(X_val)

    # Calculate residuals for this fold
    residuals = y_val - y_pred_val
    residuals_list.append(residuals)
    y_pred_list.append(y_pred_val)

    mse = mean_squared_error(y_val, y_pred_val)
    r2 = r2_score(y_val, y_pred_val)

    mse_scores.append(mse)
    r2_scores.append(r2)

avg_mse = np.mean(mse_scores)
avg_r2 = np.mean(r2_scores)

print(f"Average Mean Squared Error: {avg_mse}")
print(f"Average R-squared: {avg_r2}")

import statsmodels.api as sm

# Add a constant column to X matrix for intercept
X = sm.add_constant(X)

# Create a linear regression model
model = sm.OLS(y, X)

# Fit the model
results = model.fit()

# Print the summary
print(results.summary())
Average Mean Squared Error: 0.00031177284840925933
Average R-squared: 0.6167006702681447
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     runs_batted_in_per_pa   R-squared:                       0.668
Model:                               OLS   Adj. R-squared:                  0.659
Method:                    Least Squares   F-statistic:                     68.52
Date:                   Thu, 01 Aug 2024   Prob (F-statistic):           4.54e-68
Time:                           23:12:05   Log-Likelihood:                 840.03
No. Observations:                    316   AIC:                            -1660.
Df Residuals:                        306   BIC:                            -1622.
Df Model:                              9                                         
Covariance Type:               nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   0.0702      0.013      5.206      0.000       0.044       0.097
runs                   -0.0161      0.007     -2.467      0.014      -0.029      -0.003
batting_average         0.5763      0.157      3.664      0.000       0.267       0.886
slugging_percentage    -0.2393      0.218     -1.097      0.273      -0.668       0.190
on_base_percentage     -0.1421      0.042     -3.411      0.001      -0.224      -0.060
double_per_pa           0.0664      0.038      1.755      0.080      -0.008       0.141
triple_per_pa           0.0304      0.027      1.113      0.267      -0.023       0.084
hits_per_pa            -0.3794      0.087     -4.344      0.000      -0.551      -0.207
home_run_per_pa         0.2612      0.122      2.143      0.033       0.021       0.501
walk_per_pa            -0.0096      0.019     -0.496      0.620      -0.048       0.029
==============================================================================
Omnibus:                        3.424   Durbin-Watson:                   2.107
Prob(Omnibus):                  0.181   Jarque-Bera (JB):                3.503
Skew:                           0.241   Prob(JB):                        0.173
Kurtosis:                       2.818   Cond. No.                         476.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Evaluating the outcomes of the linear regression, we observe an R-squared value of 0.62. This value indicates that approximately 62% of the variation in the target variable can be explained by the predictor variables in the model. Analyzing the coefficients, several insights emerge. Notably, while the constants hold significance, the coefficients for individual variables display varying degrees of influence.

Interestingly, the coefficient for 'hits_per_pa' emerges as the most negative. This might seem counterintuitive at first glance, but it can be attributed to the multicollinearity present in the model. The coefficient for 'hits_per_pa' is influenced by the strong positive coefficient for 'batting_average,' as a higher batting average often translates to more hits. This interdependence suggests that 'hits_per_pa' may not be offering distinct predictive power beyond what 'batting_average' provides.

This introduces a layer of complexity when interpreting the coefficients. Multicollinearity among variables, such as 'hits_per_pa' and 'batting_average,' can make it challenging to disentangle the individual contributions of each variable to the model. In such cases, the coefficients might not accurately reflect the true impact of each variable on the target.

Given this multicollinearity, consideration should be given to potentially removing certain variables from the model. Variables that are strongly correlated or redundant could be excluded to enhance the model's interpretability without compromising its predictive power. This step aligns with the goal of crafting a model that not only predicts well but also provides insights that can be readily understood and applied in the context of baseball analytics.

Linear Regression Assumptions: Ensuring Solid Foundations

Linearity: checked before - only used variables that were linear. Normality: QQ plot doesn't look linear, but the histrogram of the residuals look normally distributed, so the assumption of normality is met.

In [21]:
# Combine residuals from all folds
all_residuals = np.concatenate(residuals_list)

# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

# Plot histogram of residuals
ax1.hist(all_residuals, bins=20, edgecolor='k')
ax1.set_xlabel('Residuals')
ax1.set_ylabel('Frequency')
ax1.set_title('Histogram of Residuals')

# Plot Q-Q plot of residuals
sm.qqplot(all_residuals, line='s', ax=ax2)  # 's' for standard normal distribution line
ax2.set_xlabel('Theoretical Quantiles')
ax2.set_ylabel('Sample Quantiles')
ax2.set_title('Q-Q Plot of Residuals')

plt.tight_layout()
plt.show()

Homoscedasticity (equal variances):A plot of residuals versus fitted values was generated, revealing a scatterplot resembling a random cloud. In such a plot, if the residuals are scattered evenly around the horizontal line at zero, it indicates that the assumption of constant variance (homoscedasticity) is likely met.

In [22]:
# Create residuals vs. fitted values plot
all_y_pred = np.concatenate(y_pred_list)

plt.figure(figsize=(8, 6))
plt.scatter(all_y_pred, all_residuals, color='blue', alpha=0.7)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted Values Plot')
plt.show()

Multicolinearity: The presence of multicollinearity in the analysis is not surprising, especially within the context of baseball statistics. Multicollinearity occurs when predictor variables in a regression model are highly correlated with each other. In baseball, many statistics are intricately linked, as different aspects of a player's performance often feed off one another. For example, a player with a high batting average is likely to have more hits, and consequently, more runs batted in (RBIs).

In [23]:
# Define the predictor variables you want to plot
predictor_vars = ['runs', 'batting_average', 'slugging_percentage', 'on_base_percentage', 'double_per_pa', 'triple_per_pa', 'hits_per_pa', 'home_run_per_pa', 'walk_per_pa']

# Create subplots for each numeric column
num_rows = (len(predictor_vars) + 2) // 3
fig, axes = plt.subplots(num_rows, 3, figsize=(15, 5 * num_rows))

# Flatten the axes if there's only one row
if num_rows == 1:
    axes = [axes]

# Create scatter plots with regression lines for predictor variables
for i, var in enumerate(predictor_vars):
    row = i // 3
    col = i % 3
    ax = axes[row][col]
    
    sns.regplot(data=batting_data_df, x=var, y='runs_batted_in_per_pa', ax=ax, color='blue', scatter_kws={'alpha': 0.5})
    ax.set_xlabel(var)
    ax.set_ylabel('RBIs per Plate Appearance')
    ax.set_title(f'{var} vs. RBIs')

# Adjust layout
plt.tight_layout()
plt.show()

In baseball, it's common to encounter significant multicollinearity among variables. Many statistics are interrelated, creating challenges when interpreting coefficients in a regression model. When variables are strongly correlated, isolating their individual effects on the target variable becomes difficult. For example, the model's coefficients might show hits negatively correlated with RBIs, likely due to compensation by the high positive value for batting average.

To address this issue, some features will be strategically removed from the model to reduce multicollinearity. This approach aims to untangle the complex relationships between variables, enhancing the understanding of how each feature contributes to the target variable's variation. This step is crucial for producing a more transparent and accurate regression analysis, providing meaningful insights into the factors driving the outcomes.

Since batting average and walks overlap heavily with on-base percentage, and triples and doubles overlap heavily with slugging percentage, these will be consolidated into on-base percentage and slugging percentage. This consolidation will help reduce redundancy and improve the model's interpretability.

Combating multicolinearity: Linear Regression Model 2

In [24]:
# Constructing another Linear Regression Model with different featuers
predictor_vars_2 = ['runs', 'on_base_percentage', 'slugging_percentage', 'home_run_per_pa']
X = batting_data_df[predictor_vars_2]
y = batting_data_df['runs_batted_in_per_pa']

scaler = MinMaxScaler()
X[predictor_vars_2] = scaler.fit_transform(X[predictor_vars_2])

# Add a constant column to X matrix for intercept
X = sm.add_constant(X)

# Perform k-fold cross-validation
num_folds = 5
kf = KFold(n_splits=num_folds, shuffle=True, random_state=1)

mse_scores = []
r2_scores = []

for train_idx, test_idx in kf.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    model = sm.OLS(y_train, X_train)
    results = model.fit()
    
    y_pred = results.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    mse_scores.append(mse)
    r2_scores.append(r2)

# Calculate average MSE and R-squared across folds
avg_mse = np.mean(mse_scores)
avg_r2 = np.mean(r2_scores)

print(f"Average Mean Squared Error: {avg_mse}")
print(f"Average R-squared: {avg_r2}")

predictor_vars_2 = ['runs', 'on_base_percentage', 'slugging_percentage', 'home_run_per_pa']
X = batting_data_df[predictor_vars_2]
y = batting_data_df['runs_batted_in_per_pa']
# Add a constant column to X matrix for intercept
X = sm.add_constant(X)

# Create a linear regression model
model = sm.OLS(y, X)

# Fit the model
results = model.fit()

# Print the summary
print(results.summary())
Average Mean Squared Error: 0.0003210269003309013
Average R-squared: 0.6183617358788298
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     runs_batted_in_per_pa   R-squared:                       0.637
Model:                               OLS   Adj. R-squared:                  0.633
Method:                    Least Squares   F-statistic:                     136.7
Date:                   Thu, 01 Aug 2024   Prob (F-statistic):           3.04e-67
Time:                           23:12:07   Log-Likelihood:                 825.95
No. Observations:                    316   AIC:                            -1642.
Df Residuals:                        311   BIC:                            -1623.
Df Model:                              4                                         
Covariance Type:               nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   0.0293      0.009      3.331      0.001       0.012       0.047
runs                   -0.0002   8.31e-05     -2.863      0.004      -0.000   -7.44e-05
on_base_percentage     -0.1021      0.046     -2.228      0.027      -0.192      -0.012
slugging_percentage     0.2626      0.038      6.940      0.000       0.188       0.337
home_run_per_pa         0.6602      0.137      4.816      0.000       0.390       0.930
==============================================================================
Omnibus:                        2.183   Durbin-Watson:                   2.137
Prob(Omnibus):                  0.336   Jarque-Bera (JB):                2.207
Skew:                           0.200   Prob(JB):                        0.332
Kurtosis:                       2.913   Cond. No.                     5.77e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.77e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Comparing the two sets of predictor variables reveals an interesting pattern. The second model, which includes fewer variables, exhibits very similar performance to the first one in explaining RBIs per plate appearance. This similarity suggests that these specific variables have a significant impact on predicting RBIs.

One of the standout findings is the importance of home runs in driving RBIs. The positive coefficient attached to home runs confirms their role in boosting RBI counts. This aligns with the common understanding that home runs often lead to multiple runs being scored, thus elevating a player's RBI tally.

An intriguing twist comes from the negative coefficient tied to on-base percentage. While initially puzzling, this can be explained by the different types of players. Those who emphasize on-base percentage—usually contact hitters—are less likely to hit home runs. Given that home runs seem to strongly contribute to RBIs, this leads to a lower RBI per plate appearance for contact hitters, potentially resulting in the observed negative coefficient for on-base percentage.

Looking at the model's performance, an R-squared value of 0.62 indicates that around 62% of the variation in RBIs per plate appearance can be explained by the variables used in the model. This suggests that while these variables provide significant insights into RBI production, other factors also play a role in the remaining variability.

The Mean Squared Error (MSE) of 0.00032, though seemingly small, needs context. Since the target variable, RBIs per plate appearance, is typically a small value below 1, squaring it for MSE calculation makes it even smaller. This highlights how the scale of the target variable influences the scale of the MSE value.

This model shows how different variables affect RBIs per plate appearance. The straightforward coefficients, the moderate R-squared value, and the scaled MSE offer a clear view of how these variables interact within the realm of baseball performance.

Random Forest Machine Learning Model: Random Forest is a machine learning algorithm that creates an ensemble of decision trees to make predictions. It aggregates the predictions of multiple trees to arrive at a more accurate and robust final prediction. In the context of baseball stats, Random Forest can be employed to predict outcomes like RBIs per plate appearance based on various player statistics. By harnessing the power of multiple decision trees and their collective insights, we can better capture the intricate relationships between different baseball metrics and enhance the understanding of factors influencing player performance.

In [25]:
# Split the data into training and testing sets for Random Forest Regressor model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)


# Create a Random Forest Regressor model
rf_model = RandomForestRegressor(random_state=42)

# Define the parameter grid for grid search
param_grid = {
    'n_estimators': [25, 50, 100, 150, 200],  
    'max_depth': [None, 5, 10, 20],  
    'min_samples_split': [2, 5, 10, 20]  
}

# Initialize GridSearchCV with the Random Forest model and parameter grid
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)

# Get the best model and its hyperparameters
best_rf_model = grid_search.best_estimator_

# Make predictions on the test data using the best model
y_pred = best_rf_model.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Best Model Parameters: {grid_search.best_params_}")
print(f"Mean Squared Error with Best Model: {mse}")
print(f"R-squared with Best Model: {r2}")

# Choose an arbitrary tree from the Random Forest (e.g., first tree)
chosen_tree = best_rf_model.estimators_[0]

plt.figure(figsize=(20, 10))
plot_tree(chosen_tree, feature_names=X_train.columns, filled=True, rounded=True)
plt.show()

feature_importances = best_rf_model.feature_importances_
sorted_indices = np.argsort(feature_importances)[::-1]  # Sort indices in descending order

for idx in sorted_indices:
    print(f"Feature: {X_train.columns[idx]}, Importance: {feature_importances[idx]}")
Best Model Parameters: {'max_depth': 5, 'min_samples_split': 20, 'n_estimators': 200}
Mean Squared Error with Best Model: 0.00041698845635485514
R-squared with Best Model: 0.500431181932357
Feature: slugging_percentage, Importance: 0.5153807953312785
Feature: home_run_per_pa, Importance: 0.42584553349050674
Feature: on_base_percentage, Importance: 0.031954056660413
Feature: runs, Importance: 0.02681961451780184
Feature: const, Importance: 0.0

In exploring predictive modeling techniques, it was observed that the Random Forest model demonstrated slightly worse R-squared and Mean Squared Error (MSE) metrics compared to the Linear Regression model. While this suggests that the Random Forest model might have a worse overall fit to the data, it is important to note the distinction between these evaluation approaches.

The R-squared and MSE values provide insights into model performance, but the context of model selection and hyperparameter tuning is crucial. For the Random Forest model, GridSearchCV was employed. This method exhaustively searches through various combinations of hyperparameters to identify the optimal configuration that minimizes the MSE. This process not only accounts for the individual folds in cross-validation but also selects the best-performing model based on the aggregate evaluation metrics.

In contrast, K-fold cross-validation applied to the Linear Regression model provides an average R-squared and MSE across multiple folds. This offers an effective assessment of the model's general performance but does not specifically target the best combination of hyperparameters as GridSearchCV does. Instead, it evaluates the model's consistency and overall predictive ability across different subsets of the data.

Thus, while the Random Forest model showcased slightly worse R-squared and MSE metrics, the true strength of GridSearchCV lies in its capability to pinpoint the most optimal model configuration based on these performance metrics. This distinction highlights the nuanced interplay between model selection, hyperparameter tuning, and evaluation approaches within predictive modeling. Therefore, it is not accurate to claim that one model is definitively better than the other based solely on R-squared and MSE metrics.

Both the Random Forest and Linear Regression models agree on the significance of home runs per plate appearance as a pivotal predictor for RBIs per plate appearance. Interestingly, the prominence of on-base percentage is unexpectedly low compared to metrics such as slugging and home runs. This consistency in model findings underscores the critical role of home runs in influencing RBI outcomes, while the comparatively modest importance assigned to on-base percentage warrants further investigation.

K-Nearest Neighbor Machine Learing Model: K Nearest Neighbors (KNN) is a machine learning algorithm that predicts outcomes by finding the "k" closest data points to a given observation and using their values to make a prediction. In the realm of baseball statistics, KNN can be applied to predict metrics like RBIs per plate appearance. By identifying similar players based on their statistical profiles, KNN allows us to infer a player's performance by looking at their nearest neighbors. This method leverages the idea that players with similar performance characteristics tend to have similar outcomes, offering insights into how various metrics relate to RBIs within the context of baseball.

In [26]:
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import numpy as np

# Create a StandardScaler instance to scale your data
scaler = StandardScaler()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Scale the feature matrices
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define a range of k values to test
k_values = range(1, 21)  # You can adjust this range as needed

# Initialize lists to store the average MSE for each k
mse_values = []

# Specify the number of folds for cross-validation
num_folds = 5  # Adjust as needed

# Iterate over different values of k
for k in k_values:
    # Initialize a K-nearest neighbors regressor with the current k value
    knn_regressor = KNeighborsRegressor(n_neighbors=k)

    # Perform K-fold cross-validation and calculate MSE scores
    mse_scores = -cross_val_score(knn_regressor, X_train_scaled, y_train, cv=num_folds, scoring='neg_mean_squared_error')

    # Calculate the average MSE across all folds for the current k
    avg_mse = np.mean(mse_scores)
    mse_values.append(avg_mse)

# Plot the MSE values for different values of k
plt.figure(figsize=(10, 6))
plt.plot(k_values, mse_values, marker='o', linestyle='-', color='b')
plt.title('Mean Squared Error vs. Number of Neighbors (k)')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Mean Squared Error (MSE)')
plt.xticks(k_values)  # Use the specified k values on the x-axis
plt.grid(True)
plt.show()
In [27]:
# Create a StandardScaler instance to scale your data
scaler = StandardScaler()

# Scale the feature matrices
X_scaled = scaler.fit_transform(X)

# Define the value of k you want to evaluate
k = 3

# Initialize a K-nearest neighbors regressor with k=3
knn_regressor = KNeighborsRegressor(n_neighbors=k)

# Perform K-fold cross-validation and get predicted values
y_pred = cross_val_predict(knn_regressor, X_scaled, y, cv=num_folds)

# Calculate MSE and R-squared for k=3
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

print(f"Statistics for k = {k}:")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
Statistics for k = 3:
Mean Squared Error: 0.00046618166695360533
R-squared: 0.4621318270034519

In baseball, significant multicollinearity among variables is common because many stats are interrelated. This interdependence poses challenges for interpreting coefficients in a regression model. When variables are strongly correlated, it becomes difficult to isolate their individual effects on the target variable. For example, hits may appear negatively correlated with RBIs, likely due to compensation by the high positive value for batting average.

To address this issue, some features will be removed from the model to reduce multicollinearity. This approach aims to clarify how each feature contributes to the variation in the target variable, making the regression analysis more transparent and accurate.

Since batting average and walks overlap heavily with on-base percentage, and triples and doubles overlap heavily with slugging percentage, these variables will be consolidated into on-base percentage and slugging percentage. This consolidation reduces redundancy and improves the model's interpretability.

Adding more data The models above didn't quite show the results that we wanted. To improve the models, I have downloaded batting data for the past 6 years in order to have a larger dataset for the models. The idea for including data from just 2023 was that the game of baseball is always changing, so it may be best to build the models based on recent data. However, there is very limited data from just one season, data was fetched for the last 6 years.

In [28]:
# Initialize a list to store the data for each year (except the current year)
batting_data = []

# Loop through the last 6 years (from 2018 to 2023)
for year in range(2018, 2024):
    # Read in the CSV file for the current year
    file_path = f"batting/{year}_batting.csv"
    df = pd.read_csv(file_path)
    
    # Append the data to the list
    batting_data.append(df)

# Concatenate the dataframes for each year
combined_data = pd.concat(batting_data)

# Specify the columns you want to aggregate
columns_to_aggregate = ['G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB']

Some columns, such as on base percentage and slugging percentage can't just be aggregated and added together. Luckily, we have the columns necessary to manually calculate them.

In [29]:
# Group by 'Name' and aggregate the specified columns
agg_data = combined_data.groupby('Name')[columns_to_aggregate].sum().reset_index()

# Filter for players with over 1000 plate appearances
agg_data_filtered = agg_data[agg_data['PA'] > 1000]

# Calculate 'OBP' (On-Base Percentage) and 'SLG' (Slugging Percentage) using .loc
agg_data_filtered.loc[:, 'OBP'] = (agg_data_filtered['H'] + agg_data_filtered['BB'] + agg_data_filtered['HBP']) / agg_data_filtered['PA']
agg_data_filtered.loc[:, 'SLG'] = agg_data_filtered['TB'] / agg_data_filtered['AB']

# Manually input 'OPS' (On-base Plus Slugging)
agg_data_filtered['OPS'] = agg_data_filtered['OBP'] + agg_data_filtered['SLG']

# Print or further analyze the filtered aggregated data with fixed metrics
print(agg_data_filtered)
                    Name    G    PA    AB    R    H   2B  3B   HR  RBI  ...  \
3             AJ Pollock  617  2249  2056  284  520  106   8   97  301  ...   
13          Aaron Hicks#  578  2180  1844  310  439   66   9   73  258  ...   
14           Aaron Judge  642  2799  2342  467  664  107   1  196  436  ...   
24         Abraham Toro#  366  1309  1177  156  258   42   3   39  154  ...   
26    Adalberto Mondesí#  286  1157  1085  160  277   52  17   35  141  ...   
...                  ...  ...   ...   ...  ...  ...  ...  ..  ...  ...  ...   
2282     Yoshi Tsutsugo*  318  1077   939  114  197   41   4   34  134  ...   
2285       Yoán Moncada#  664  2809  2506  332  640  142  17   82  312  ...   
2288        Yuli Gurriel  723  2919  2671  349  745  172   7   77  371  ...   
2302     Zach McKinstry*  315  1035   932  117  202   41  10   25   89  ...   
2322       Óscar Mercado  407  1228  1131  157  261   58   7   34  132  ...   

       SO    TB  GDP  HBP  SH  SF  IBB       OBP       SLG       OPS  
3     461   933   38   23   1  23    8  0.306358  0.453794  0.760152  
13    486   742   24   10   3  15    7  0.347248  0.402386  0.749634  
14    775  1361   61   18   0  19   35  0.391568  0.581127  0.972696  
24    220   423   17   25   0   9    1  0.291062  0.359388  0.650450  
26    342   468   16    3  11   7    0  0.286085  0.431336  0.717421  
...   ...   ...  ...  ...  ..  ..  ...       ...       ...       ...  
2282  289   348   16    4   2   8    1  0.301764  0.370607  0.672371  
2285  808  1062   22   20   4  11    6  0.330367  0.423783  0.754150  
2288  340  1162   80   24   0  37    7  0.327509  0.435043  0.762552  
2302  264   338    7    6   7   5    1  0.283092  0.362661  0.645753  
2322  243   435   34   10   9   8    0  0.277687  0.384615  0.662303  

[399 rows x 23 columns]

The columns were mapped to match the other dataframe

In [30]:
# Creating a dictionary to map the variables.
column_name_mapping = {
    'Age': 'age',
    'Lev': 'league',
    'Tm': 'team',
    'G': 'games',
    'PA': 'plate_appearances',
    'AB': 'at_bats',
    'R': 'runs',
    'H': 'hits',
    '2B': 'double',
    '3B': 'triple',
    'HR': 'home_run',
    'RBI': 'runs_batted_in',
    'BB': 'walk',
    'IBB': 'intentional_walk',
    'SO': 'strikeouts',
    'HBP': 'hit_by_pitch',
    'SH': 'sacrifice_hit',
    'SF': 'sacrifice_fly',
    'GDP': 'double_play',
    'BA': 'batting_average',
    'OBP': 'on_base_percentage',
    'SLG': 'slugging_percentage',
    'OPS': 'on_base_plus_slugging'
}

agg_data_filtered.rename(columns=column_name_mapping, inplace=True)
agg_data_filtered.head()
Out[30]:
Name games plate_appearances at_bats runs hits double triple home_run runs_batted_in ... strikeouts TB double_play hit_by_pitch sacrifice_hit sacrifice_fly intentional_walk on_base_percentage slugging_percentage on_base_plus_slugging
3 AJ Pollock 617 2249 2056 284 520 106 8 97 301 ... 461 933 38 23 1 23 8 0.306358 0.453794 0.760152
13 Aaron Hicks# 578 2180 1844 310 439 66 9 73 258 ... 486 742 24 10 3 15 7 0.347248 0.402386 0.749634
14 Aaron Judge 642 2799 2342 467 664 107 1 196 436 ... 775 1361 61 18 0 19 35 0.391568 0.581127 0.972696
24 Abraham Toro# 366 1309 1177 156 258 42 3 39 154 ... 220 423 17 25 0 9 1 0.291062 0.359388 0.650450
26 Adalberto Mondesí# 286 1157 1085 160 277 52 17 35 141 ... 342 468 16 3 11 7 0 0.286085 0.431336 0.717421

5 rows × 23 columns

In [31]:
# Create a scatter plot of hits vs. runs_batted_in
plt.figure(figsize=(10, 5))
plt.scatter(agg_data_filtered['hits'], agg_data_filtered['runs_batted_in'], alpha=0.5)
plt.title('Hits vs. Runs Batted In')
plt.xlabel('Hits')
plt.ylabel('Runs Batted In')
plt.grid(True)
plt.show()

# Create a scatter plot of runs_batted_in vs. home_run
plt.figure(figsize=(10, 5))
plt.scatter(agg_data_filtered['home_run'], agg_data_filtered['runs_batted_in'], alpha=0.5)
plt.title('Runs Batted In vs. Home Runs')
plt.xlabel('Home Runs')
plt.ylabel('Runs Batted In')
plt.grid(True)
plt.show()

# Create a scatter plot of on_base_plus_slugging (OPS) vs. runs_batted_in
plt.figure(figsize=(10, 5))
plt.scatter(agg_data_filtered['on_base_plus_slugging'], agg_data_filtered['runs_batted_in'], alpha=0.5)
plt.title('On-Base Plus Slugging (OPS) vs. Runs Batted In')
plt.xlabel('OPS')
plt.ylabel('Runs Batted In')
plt.grid(True)
plt.show()

Analyzing the New Scatter Plots Recent scatter plots have revealed interesting trends that provide valuable insights into the relationships between key baseball statistics. These new graphs, based on a larger dataset spanning five seasons, offer a more comprehensive view of player performance.

The relationships between the variables now show enhanced linearity, largely due to the increased number of data points from multiple seasons. This broader spectrum of player performance makes the improvement in linearity evident when compared to plots based on a single season's data.

The scatter plot comparing On-Base Plus Slugging (OPS) to Runs Batted In (RBI) reveals a moderately positive correlation. This indicates that as a player's OPS increases, their RBI count tends to rise as well. This highlights the importance of a player's ability to get on base and produce extra-base hits, which significantly contribute to RBI totals.

The next step involves normalizing these statistics per plate appearance to provide a more precise comparison. This normalization ensures a level playing field for all players, regardless of the number of plate appearances. By doing so, the analysis will yield more accurate conclusions about player performance and its correlation with key metrics.

By continually refining the analysis and leveraging a multi-season dataset, a more robust model for evaluating player performance in baseball can be developed.

In [32]:
# Normalizing variables
agg_data_filtered
variables_to_normalize = ['hits', 'double', 'triple', 'home_run', 'runs_batted_in', 
                          'walk', 'intentional_walk', 'strikeouts', 'hit_by_pitch',
                          'sacrifice_hit', 'sacrifice_fly', 'double_play']

# Iterate through the variables and add normalized columns, then drop the original columns
for var in variables_to_normalize:
    normalized_col_name = f'{var}_per_pa'
    agg_data_filtered[normalized_col_name] = agg_data_filtered[var] / agg_data_filtered['plate_appearances']
    agg_data_filtered.drop(columns=[var], inplace=True)
In [33]:
# Define the x-variables to plot
x_variables = ['hits_per_pa', 'home_run_per_pa', 'on_base_plus_slugging']

# Create scatter plots for RBI per PA vs. the selected x-variables
for x_var in x_variables:
    plt.figure(figsize=(10, 5))
    plt.scatter(agg_data_filtered[x_var], agg_data_filtered['runs_batted_in_per_pa'], alpha=0.5)
    plt.title(f'RBI per PA vs. {x_var.capitalize()}')
    plt.xlabel(x_var.capitalize())
    plt.ylabel('RBI per PA')
    plt.grid(True)
    plt.show()

Analyzing the Relationships

The in-depth analysis of the relationship between Runs Batted In per Plate Appearance (RBI per PA) and key baseball statistics—Hits, Home Runs, and On-Base Plus Slugging (OPS)—has provided valuable insights into the factors influencing a player's offensive performance.

Home Runs per PA and RBI per PA: A standout finding is the strong connection between Home Runs per PA and RBI per PA. This relationship highlights the critical role of power hitting, particularly home runs, in driving in runs. The correlation suggests that players who excel at hitting home runs tend to accumulate RBIs at a higher rate. This observation aligns with previous machine learning and regression models, which consistently identified home runs as a key predictor of RBI production. The robustness of this relationship, even with a larger dataset, underscores the significant impact of home runs on run production.

Hits per PA and RBI per PA: Conversely, the relationship between Hits per PA and RBI per PA is less pronounced. While hits are essential for offensive success, this analysis indicates that their impact on RBI production may not be as substantial as power hitting. This suggests that accumulating hits alone does not necessarily lead to a higher RBI count.

OPS and RBI per PA: For OPS, which combines a player's on-base ability and slugging power, there is a balanced yet moderate connection with RBI production. This equilibrium highlights the importance of both getting on base and hitting for power. Previous machine learning and regression models also pointed to OPS as a valuable predictor of RBIs, and this observation with a larger dataset reaffirms its relevance.

Summary: The findings from this analysis resonate with insights from previous machine learning and regression models. The consistent relationship between home runs and RBIs, even with more data, emphasizes the pivotal role of power hitting in driving in runs. These insights deepen the understanding of player performance dynamics in baseball, highlighting the factors that significantly contribute to success on the field.

Linear Regression Model

In [34]:
# Define predictor variables and the dependent variable
predictor_vars = ['runs', 'on_base_percentage', 'slugging_percentage', 'home_run_per_pa']
X = agg_data_filtered[predictor_vars]
y = agg_data_filtered['runs_batted_in_per_pa']

# Scale the predictor variables
scaler = MinMaxScaler()
X[predictor_vars_2] = scaler.fit_transform(X[predictor_vars_2])

# Add a constant column to X matrix for intercept
X = sm.add_constant(X)

# Perform k-fold cross-validation
num_folds = 5
kf = KFold(n_splits=num_folds, shuffle=True, random_state=1)

mse_scores = []
r2_scores = []

for train_idx, test_idx in kf.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    model = sm.OLS(y_train, X_train)
    results = model.fit()
    
    y_pred = results.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    mse_scores.append(mse)
    r2_scores.append(r2)

# Calculate average MSE and R-squared across folds
avg_mse = np.mean(mse_scores)
avg_r2 = np.mean(r2_scores)

print(f"Average Mean Squared Error: {avg_mse}")
print(f"Average R-squared: {avg_r2}")

# Create a linear regression model
model = sm.OLS(y, X)

# Fit the model
results = model.fit()

# Print the summary
print(results.summary())
Average Mean Squared Error: 0.00012198247368073569
Average R-squared: 0.7535595648443909
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     runs_batted_in_per_pa   R-squared:                       0.769
Model:                               OLS   Adj. R-squared:                  0.766
Method:                    Least Squares   F-statistic:                     327.3
Date:                   Thu, 01 Aug 2024   Prob (F-statistic):          8.66e-124
Time:                           23:12:48   Log-Likelihood:                 1238.7
No. Observations:                    399   AIC:                            -2467.
Df Residuals:                        394   BIC:                            -2448.
Df Model:                              4                                         
Covariance Type:               nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   0.0700      0.002     33.478      0.000       0.066       0.074
runs                   -0.0103      0.003     -3.208      0.001      -0.017      -0.004
on_base_percentage     -0.0124      0.006     -2.192      0.029      -0.024      -0.001
slugging_percentage     0.0712      0.009      7.708      0.000       0.053       0.089
home_run_per_pa         0.0588      0.006      9.848      0.000       0.047       0.071
==============================================================================
Omnibus:                        0.233   Durbin-Watson:                   2.090
Prob(Omnibus):                  0.890   Jarque-Bera (JB):                0.108
Skew:                           0.024   Prob(JB):                        0.947
Kurtosis:                       3.065   Cond. No.                         27.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Leveraging More Data for Improved Predictions

The regression model's performance significantly improved when applied to a five-year dataset. The Mean Squared Error (MSE) decreased from 0.0032 to 0.0012, and the R-squared value increased from 0.62 to 0.75.

This improvement highlights the value of a larger dataset. With more data points, the model gains a deeper understanding of the relationships between predictor variables like runs, slugging percentage, and home runs per Plate Appearance (PA) and the dependent variable, runs batted in per PA.

More data enables the model to capture a broader range of player performances and reduces the risk of overfitting. In this context, it reaffirms the significance of power hitting, particularly home runs and slugging, over simply getting on base.

The regression results with the larger dataset provide quantitative evidence supporting the importance of power hitting in driving in runs. Notably, the coefficient associated with 'slugging_percentage' stands out, indicating a strong link between slugging and runs batted in. Additionally, the coefficient for 'home_run_per_pa' underscores the pivotal role of home runs in run production

Random Forest Model

In [ ]:
predictor_vars = ['runs', 'on_base_percentage', 'slugging_percentage', 'home_run_per_pa']
X = agg_data_filtered[predictor_vars]
y = agg_data_filtered['runs_batted_in_per_pa']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)


# Create a Random Forest Regressor model
rf_model = RandomForestRegressor(random_state=42)

# Define the parameter grid for grid search
param_grid = {
    'n_estimators': [25, 50, 100, 150],  
    'max_depth': [None, 5, 10, 20],  
    'min_samples_split': [2, 5, 10, 20]  
}

# Initialize GridSearchCV with the Random Forest model and parameter grid
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)

# Get the best model and its hyperparameters
best_rf_model = grid_search.best_estimator_

# Make predictions on the test data using the best model
y_pred = best_rf_model.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Best Model Parameters: {grid_search.best_params_}")
print(f"Mean Squared Error with Best Model: {mse}")
print(f"R-squared with Best Model: {r2}")

# Choose an arbitrary tree from the Random Forest (e.g., first tree)
chosen_tree = best_rf_model.estimators_[0]

plt.figure(figsize=(20, 10))
plot_tree(chosen_tree, feature_names=X_train.columns, filled=True, rounded=True)
plt.show()

feature_importances = best_rf_model.feature_importances_
sorted_indices = np.argsort(feature_importances)[::-1]  # Sort indices in descending order

for idx in sorted_indices:
    print(f"Feature: {X_train.columns[idx]}, Importance: {feature_importances[idx]}")
Best Model Parameters: {'max_depth': None, 'min_samples_split': 20, 'n_estimators': 100}
Mean Squared Error with Best Model: 0.00012422084331287662
R-squared with Best Model: 0.7395001595887596

The Random Forest Regressor model also exhibited notable improvement when applied to a larger five-year dataset. In the previous analysis using a one-year dataset, the Mean Squared Error (MSE) was 0.0004, and the R-squared value stood at 0.5. At that time, the importance score for 'home runs' was 0.4.

With the addition of more data, we observed a substantial reduction in the MSE to 0.0012, and the R-squared value increased to 0.67. Interestingly, the importance score for 'home runs' surged to 0.81, highlighting its dominant role in predicting runs batted in. On the other hand, the importance of 'slugging' decreased to 0.14.

These findings further emphasize the importance of home runs in run production. The RandomForestRegressor model's enhanced performance with more data corroborates that power hitting, specifically through home runs, remains a pivotal factor in a player's ability to contribute significantly to offensive success in baseball.

Evaluating Model Performance

Random Forest with 6-Season Data:

  • The Random Forest model with 6-season data achieved impressive results, with the best model parameters being max_depth=10, min_samples_split=10, and n_estimators=150.
  • The Mean Squared Error (MSE) of approximately 0.000123 indicates that the model's predictions are quite close to the actual values, highlighting its predictive accuracy.
  • The R-squared value of around 0.741 indicates that the model explains a substantial portion of the variance in the data, further underscoring its effectiveness.

Linear Regression with 6-Season Data:

  • Linear Regression with 6-season data also performed well, with an average Mean Squared Error of approximately 0.000122 and an average R-squared of about 0.754.
  • The R-squared value indicates that this linear regression model explains a significant portion of the variability in the data, making it a suitable choice for prediction.

KNN with 1-Season Data:

  • KNN with 1-season data exhibited a comparatively higher Mean Squared Error, averaging around 0.000403, suggesting that its predictions may not be as accurate as the 6-season data models.
  • However, the average R-squared value of approximately 0.521 indicates that KNN provides some explanatory power, although it falls short of the performance of the 6-season data models.

Random Forest with 1-Season Data:

  • Random Forest with 1-season data performed reasonably well, with the best model parameters being max_depth=10, min_samples_split=10, and n_estimators=50.
  • The Mean Squared Error of approximately 0.000580 indicates that it predicts reasonably well within the context of the 1-season dataset.
  • However, the R-squared value of around 0.407 suggests that this model explains only a moderate portion of the variance in the data.

Linear Regression Model 2 with 1-Season Data:

  • Linear Regression Model 2 with 1-season data achieved an average Mean Squared Error of approximately 0.000368.
  • However, the average R-squared of about 0.581 indicates that while it provides some explanation, the R-squared isn't very strong, considering the 1-season data context.

Linear Regression Model 1 with 1-Season Data:

  • Linear Regression Model 1 with 1-season data showed an average Mean Squared Error of about 0.000372.
  • The average R-squared of around 0.560 suggests that it provides limited explanatory power within the 1-season data context.

In summary, the evaluation of model performance reveals that both the Random Forest and Linear Regression models with 6-season data stand out as strong performers, with excellent predictive accuracy and the ability to explain a substantial portion of the variance in the data. However, when dealing with 1-season data, the performance of the models is slightly reduced, with KNN exhibiting higher MSE, and Random Forest and Linear Regression models providing reasonable but less accurate predictions compared to the 6-season data models. The choice of the most suitable model should consider the specific context and requirements of your analysis, along with further exploration and validation.

Exploratory Data Analysis (EDA)

To understand the relationships between player statistics and Runs Batted In (RBIs) per plate appearance, a comprehensive Exploratory Data Analysis (EDA) was conducted. By employing visualizations and data exploration techniques, the analysis aimed to uncover insights that would guide subsequent modeling efforts.

Positive Correlations: The EDA revealed consistent positive correlations between various variables and RBIs per plate appearance. These correlations provided valuable initial insights into how player statistics relate to RBIs, acting as signals for predictive modeling.

Machine Learning and Regression Models

Dominant Role of Home Runs: The machine learning and regression models highlighted the pivotal role of home runs in predicting RBIs. Both Random Forest and Linear Regression models underscored the significance of home runs in a player's ability to drive in runs, reaffirming the conventional wisdom that "home runs drive the offense."

Complexity of On-Base Percentage (OBP): While metrics like slugging and home runs were significant, on-base percentage emerged as less influential. This may be due to the diverse offensive strategies employed by players. Contact hitters who emphasize on-base percentage might have different RBI profiles compared to power hitters.

Future Directions and Considerations

The project sets the stage for further exploration and offers several avenues for future research. One promising direction is the incorporation of additional variables or the engineering of new ones. For example, metrics like batting average with runners in scoring position could provide a more nuanced understanding of a player's clutch performance.

Exploring New Variables: Team on-base percentage (OBP) was also considered, but the results showed little linear correlation with RBIs. This doesn't preclude the possibility of incorporating other meaningful variables in future analyses.

Ongoing research is encouraged to refine models and explore novel dimensions of player performance. Expanding the scope of analysis and considering diverse factors can deepen the understanding of baseball statistics and their impact on RBIs.

Real-World Application

Strategic Player Acquisition: Baseball teams in need of an offensive boost can leverage this model to make informed decisions when acquiring new players. By inputting a hitter's statistics into the model, teams can project how many RBIs they are likely to contribute per plate appearance. While RBIs alone don't determine wins, the project's emphasis on the importance of home runs in driving runs suggests that struggling teams seeking offensive firepower may prioritize recruiting players with a knack for hitting home runs, potentially rejuvenating their offensive capabilities.

In [ ]: